Data Visualization

Rene Perez.

Mini-Project 2

This Data-visualization project is composed of California Housing Prices from the 1990 Census of the State of California.

  1. California Housing Prices

The objective is to make use of the toolset and principles of data visualization, displaying and uncovering trends, patterns, tendencies, and outlieres, using ggplot for R, this report will create:

  1. Data trasnformation using functions like, filter, select, group_by and other.

  2. Bar charts,line charts, and others.

  3. Scatter plots, histograms.

  4. Dashboards.

  5. Gggplot is the library used

  6. Coding language is R.

  7. Rstudio is the integrated development environment.

  8. For spatial visualization the package is SF.

  9. Fitting of a Linear Regression Analysis.

Data Visualization.

California housing
California housing

Executive Summary (Linear Regression Analysis)

  1. The model fits reasonably well (R² ≈ 0.65).

  2. Most variables are statistically significant.

  3. median_income is the strongest positive predictor.

  4. Location features (longitude, latitude, ocean_proximity) are very important.

  5. Population and housing structure (rooms, households) affect value but may be entangled in multicollinearity1.

Linear Regression Output

Dependent variable:
median_house_value
longitude -26,812.990***
(1,019.651)
latitude -25,482.190***
(1,004.702)
housing_median_age 1,072.520***
(43.886)
total_rooms -6.193***
(0.791)
total_bedrooms 100.556***
(6.869)
population -37.969***
(1.076)
households 49.617***
(7.451)
median_income 39,259.570***
(338.005)
ocean_proximityINLAND -39,284.300***
(1,744.258)
ocean_proximityISLAND 152,901.900***
(30,741.880)
ocean_proximityNEAR.BAY -3,954.052**
(1,913.339)
ocean_proximityNEAR.OCEAN 4,278.134***
(1,569.525)
Constant -2,269,954.000***
(88,013.880)
Observations 20,433
R2 0.646
Adjusted R2 0.646
Residual Std. Error 68,656.950 (df = 20420)
F Statistic 3,111.608*** (df = 12; 20420)
Note: p<0.1; p<0.05; p<0.01

  1. Multicollinearity happens when two or more predictor variables in a regression model are highly correlated with each other. This means they contain overlapping information, which makes it hard for the model to determine which variable is actually influencing the outcome.↩︎